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B.  INTRODUCTION 

Interval-censored  (IC)  data  are  encountered  in  three  areas  of  breast  cancer  research. 
The  most  common  application  is  in  clinical  relapse  follow-up  studies  in  which  the  study 
endpoint  is  disease-free  survival.  When  a  patient  relapses,  it  is  usually  known  that  the 
relapse  takes  place  between  two  follow-up  visits,  and  the  exact  time  to  relapse  is  unknown. 
In  statistics,  we  say  relapse  time  is  interval  censored.  Interval  censoring  is  also  encountered 
in  breast  cancer  registry  studies  in  which  information  on  family  history  of  cancer  is  updated 
periodically.  The  Strang  Breast  Surveillance  Program  for  women  at  increased  risk  for  breast 
cancer,  for  instance,  has  enlisted  over  800  women  with  complete  pedigree  information  which 
is  verified  and  updated  continuously.  Family  history  data  such  as  age  at  diagnosis  of  a 
specific  cancer,  or  a  benign  but  risk-conferring  condition,  are  obtained  from  each  registrant 
at  each  update.  Time  to  a  cancer  event,  and  definitely  time  to  first  detection  of  a  benign 
condition,  are  at  best  known  to  fall  in  the  time  interval  between  the  last  update  and  age 
at  diagnosis.  A  third  but  increasingly  important  area  of  application  of  interval  censoring 
is  in  breast  cancer  chemoprevention  experiments  or  prevention  trials,  which  involve  the 
observation  of  one  or  more  surrogate  endpoint  biomarkers  (SEB)  over  time.  The  scientific 
question  of  interest  here  is  the  estimation  of  time  for  the  SEB  to  reach  a  target  value, 
and  time  from  cessation  of  intake  of  a  chemopreventive  agent  to  the  loss  of  its  protective 
effect.  Unfortunately,  the  exact  values  of  both  these  time  variables  are  known  only  to  lie  in 
between  two  successive  assay  inspection  times.  In  a  breast  cancer  follow-up  study,  we  will 
often  encounter  covariates  (for  instance,  tumor  size  and  nodal  status  in  a  relapse  study,  and 
baseline  SEB  value  in  a  chemoprevention  trial). 

Let  X  denote  a  time-to-event  variable  with  distribution  F{x)  =  Pr{X  <  x),  or  equiv¬ 
alently,  survival  function  S{x)  =  1  —  F(a;).  In  interval  censoring,  X  is  not  observed  and 
is  known  only  to  lie  in  an  observable  interval  {L,R).  In  our  previous  DOD  funded  grant, 
we  have  made  fundamental  contributions  to  both  the  theory  of  the  generalized  maximum 
likelihood  (GML)  estimation  of  S,  and  the  computation  in  connection  with  the  inference  of 
GML  estimator  (GMLE)  S  of  S.  These  contributions  are  restricted  to  the  case  of  univariate 
interval-censored  data  without  covariates. 

The  Cox  proportional  hazards  model  [1]  specifies  that  covariates  have  a  proportional 
effect  on  the  hazard  function  of  X.  This  model  provides  powerful  means  for  fitting  failure 
time  observations  to  a  distribution  free  model  and  for  estimating  the  risk  for  failure  associ¬ 
ated  with  a  vector  of  covariates.  It  is  extensively  used  for  right-censored  data.  Finkelstein 
[2]  applied  the  Cox  model  to  analysis  of  interval-censored  data.  However,  she  did  not  estab¬ 
lish  asymptotic  properties  of  the  GMLE  of  the  parameters  in  the  model  and  the  approach 
is  limited  to  small  sample  sizes  due  to  the  computational  difficulty  . 

Our  interest  in  IC  data  with  covariates  is  driven  by  needs  arising  from  two  related 
areas  of  breast  cancer  research  at  Strang.  First,  our  investigators  in  the  Strang  Cancer 
Genetics  Program  want  to  study  various  patterns  of  familial  aggregation  of  breast,  ovarian 
and  other  forms  of  cancer  using  family  history  data  from  the  Strang  Breast  Surveillance 
Program.  Studies  of  familial  early  onset  of  breast  cancer,  breast-ovarian  and  breast-prostate 
associations  will  lead  to  IC  data  with  covariates;  therefore,  a  proper  statistical  procedure 
together  with  a  feasible  software  to  deal  with  such  data  are  very  much  needed.  Second, 
we  conducted  a  one-year  chemoprevention  trial  of  indole-3-carbinol  (I3C)  for  breast  cancer 
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prevention.  In  this  prevention  trial  we  monitored  the  levels  of  two  SEB’s,  a  urinary  estrogen 
metabolite  ratio  and  a  blood  counterpart,  both  of  which  are  subject  to  interval  censoring. 
An  earlier  dose-ranging  study  of  ISC  conducted  by  Wong  et  al  [2]  has  been  published. 

The  overall  aim  of  this  research  proposal  is  to  develop  statistical  inference  for  interval- 
censored  data  with  covariates  that  are  encountered  in  breast  cancer  chemoprevention  trials 
employing  surrogate  endpoint  biomarkers,  and  in  breast  cancer  registry  follow-up  studies  of 
familial  aggregation  of  breast  and  other  forms  of  cancer.  Asymptotic  generalized  maximum 
likelihood  theory  under  the  Cox  regression  model  will  be  investigated  and  computer  software 
package  for  maximum  likelihood  inference  will  be  implemented. 

C.  BODY 

C.l.  Model  Formulation  and  Likelihood  Equations. 

Let  Yka  <  Yk,2  <  ■  ■  <  Yk,k  denote  the  follow-up  times  for  a  patient  who  has  made 
K  follow-up  visits,  in  a  longitudinal  follow-up  study.  Since  the  number  of  visits  for  each 
patient  may  vary,  K  is  a,  random  positive  integer.  For  convenience,  define  1^,0  =  0  ^'^d 
Yk  x+1  =  oo.  The  time-to-event  variable  of  interest,  X,  is  not  directly  observed;  instead,  it 
is  known  to  lie  in  between  two  successive  censoring  time  points  (Ykj^Ykj+i),  where  j  =0, 
...,  K.  Note  that  X  is  left  censored  if  j  =  0,  strictly  interval  censored  if  0  <  j  <  K,  and 
right  censored  if  Y  >  Yk,k-  The  observable  interval-censored  data  corresponding  to  X  is 
given  by 

(L,  R)  =  (YK.i,  if  YK,i  <x<  i  =  0. 1.  K.  (2.1) 

In  addition  to  (L,  R),  we  also  observe  a  p  x  1  covariate  vector  Z.  We  assume  that  K 
and  the  Y^j’s  are  independent  of  (X,  Z). 

The  Cox  regression  model  for  the  survival  function  at  X  =  x  given  Z  =  zis  represented 
by 

S(x|z)  =  [SoWl'". 

where  zji  is  the  dot  product  of  Z  and  /3,  Sq{x)  is  a  baseline  survival  function  and  ,0  is  a 
p-dimensional  regression  coefficient  vector. 

Let  li  =  {Li,Ri,Zi),  i  =  1,  ...,  n,  be  a  random  sample  of  size  n  interval-censored 
observations  with  covariates.  In  terms  of  the  original  observed  intervals,  the  likelihood 
function  of  S  and  b  is  given  by 


L  =  ),  (2.2) 

i—l 

where  S  is  a  survival  function,  and  6  is  a  p  x  1  dimensional  vector.  The  GMLE  of  {So,P)  is 
a  value  (-S',  6)  that  maximizes  (2.2)  over  all  survival  functions  S  and  all  b  G  RP. 

Since  So  places  all  probability  mass  on  the  innermost  intervals  of  the  Jj’s  (see  Peto 
(1973)  or  Turnbull  (1976)),  it  is  often  computationally  simpler  to  express  L  in  terms  of 
innermost  intervals. 

We  say  that  an  interval  A  is  an  innermost  interval  of  the  R's  if  A  is  a  nonempty  finite 
intersection  of  one  or  more  of  the  Jj’s  such  that  either  /j  n  A  =  0  or  n  A  =  A  for  each 
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i.  Suppose  there  are  a  total  of  m  distinct  innermost  intervals  Ai  =  where  rji  < 

and  m<n.  Then  the  likelihood  function  (2.2)  is  equivalently  given  by 

i=l  k>li  k>ri 

where  k  =  sup{j  :  rjj  <  Li},  n  =  sup{j  :  ■qj  <  Ri]  and  s  =  (si,  denote  the  vector 

of  the  probability  weights.  The  log  likelihood  of  {s,b)  is 

£(s,  6)  =  ^  ln[(  X:  -  ( E  (2-4) 

i— 1  k'^l'i  k'^  T  i 


Note  that  (X)fe>ri  =  1  if  fi  =  0  and  =  0  if  /i  =  m. 

C.2.  Generalized  maximum  likelihood  estimation. 

A  GMLE  of  (s,y0)  is  a  value  of  (s,  6)  that  maximizes  the  likelihood  function  (2.4).  We 
could  follow  the  Newton-Raphson  (NR)  algorithm  taken  by  Finkelstein  [2].  However,  this 
would  involve  the  inverse  of  a  matrix  of  order  (m  +  p  —  1)  x  (m+p  —  1).  Since  m  can  be 
potentially  large  when  n  is  large,  the  NP  algorithm  is  not  feasible  for  a  large  data  set.  In 
our  simulation  studies  with  n  =  200,  m  ranges  from  17  to  22. 

We  advocate  a  computationally  simple  approach  by  first  grouping  the  original  data 
{Li,Ri)  and  then  applying  a  two-step  iterative  scheme  to  obtain  the  two-step  estimators 
(TSE)  of  So  and  P  based  on  the  innermost  interval  corresponding  to  the  grouped  intervals. 

In  the  first  year  of  our  research,  we  have  successfully  implemented  the  computer  software 
to  calculate  the  TSE’s  of  So  and  The  algorithm  is  summarized  as  follows. 

1.  Partition  the  entire  data  range  into  q  time  points,  q  <  n.  Let  (L|,i?*)  denote  the 
grouped  observable  intervals,  i  =  1,  ...,  n.  Let  s  =  (si, Sm^)  denote  the  vector  of 
probability  masses  distributed  over  the  rriq  <  m  innermost  intervals  corresponding  to 
the  (L*,R*)’s 

2.  Maximize  the  likelihood  of  s  and  b  based  on  the  grouped  data  using  a  two-step  maxi¬ 
mization  algorithm.  At  each  iteration  of  the  algorithm,  there  is  an  s— step  in  which  the 
likelihood  is  increased  by  changing  a  transformed  parameter  of  s,  while  b  is  fixed  at  the 
value  from  the  previous  iteration.  This  is  followed  by  a  6— step  in  which  the  likelihood 
is  maximized  with  respect  to  b  with  s  fixed  at  the  value  updated  at  the  current  s— step. 

For  ease  of  presentation,  we  outline  the  algorithm  in  the  case  iriq  =  3  so  that  s  = 
(si,  S2,  S3). 


s— step; 


a.  Transform  s  to  s(m),  where  s(u)  =  (si(m),S2(u),S3(«)), 


si(m)  = 


Si  -|-  u 
1-f  M  ’ 


S2{u) 


S2 

1  +  U 


ssiu) 


S3 

1  +  u’ 
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and  u  is  such  that  m  +  si  >0. 

b.  Use  NP  algorithm  to  maximize  £(s(w),6)  with  respect  to  u.  Denote  the  maximizer  by 
u\.  Let  s*  =  s(tti). 

c.  Transform  s*  to  where  s*(u)  =  (si(u),S2(“)) 


l  +  u’ 


s^+u 
1  +  «  ’ 


and  s'. 


;(«)  =  Y 


+  u 


d.  Use  NP  algorithm  to  maximize  C{s*{u),b)  with  respect  to  u.  Denote  the  maximizer  by 
U2-  Let  s**  =  s*('U2)- 

e.  Transform  s**  to  where  s**(u)  =  (si*(u), 


1  +  u’ 


(a) 


s 


** 

2 


1  4-  u’ 


and  sl*{u) 


si*  +  u 

1  +  tt 


f.  Use  NP  algorithm  to  maximize  C{s**{u),b)  with  respect  to  u.  Denote  the  maximizer 
by  U3.  Let  s***  =  s**(a3)- 


6— step: 


Use  NP  algorithm  to  maximize  C{s***,b)  with  respect  to  b. 

Repeat  s-step  and  6-step  until  convergence. 

C.3.  Sensitivity  study  of  TSE. 

We  have  carried  out  Monte  Carlo  simulations  to  investigate  the  sensitivity  of  the  TSE 
of  /3  to  the  degree  of  partitioning  used  in  the  data  grouping.  Our  simulation  studies  are 
designed  as  follows: 

1.  X  is  exponential  with  pdf  f{x)  =  —  a)],  where  ![•]  denotes  the 

indicator  function. 

2.  There  are  3  mutually  independent  covariates  Zi,  Z2  and  Z3,  each  of  which  is  a  discrete 

random  variable  with  pdf  f(i)  =  i  =  1,  ...,6. 

3.  (L,  R)  is  generated  according  to  the  following  scheme: 


r(0,17)  ifX<U, 

(L,R)  =  ^  (20,oo)  ifX>20, 

[  {U  +  kV,U  +  {k  +  1)^)  if  X  <  20,  U  +  fcU  <  X  <  U  +  (A:  +  l)y  and  A:  >  1. 

where  U  ~  17(0,2)  and  V  ~  17(0,2.3). 

For  each  of  Monte  Carlo  simulation,  a  total  of  1000  replications  are  performed.  Group¬ 
ing  width  of  sizes  3,  5  and  8  are  considered  in  the  partitioning  of  the  interval  [0, 20]. 

Tables  1  and  2  summarize  the  simulation  results  for  sample  sizes  n  —  30  and  n  =  200, 
respectively.  For  original  data  (no  grouping),  GMLE  values  are  given.  In  the  3  cases  of  data 
grouping,  TSE  values  are  listed.  The  figures  given  in  parentheses  are  standardized  differences 
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defined  as  I  sample  mean  of  estimator  -  true  value J  Essentially,  the  TSE  is  quite  insensitive 
ueimeuds  sample  standard  error  of  estimator  . 

to  changes  in  the  degree  of  partitioning,  and  sample  size  appears  to  be  not  a  relevant  issue 
either.  The  conclusion  here  applies  to  the  parameter  estimate  of  P  only.  However,  in 
assessing  closeness  of  asymptotic  inference  of  the  TSE  to  that  of  the  GMLE,  we  will  have  to 
pay  attention  to  the  asymptotic  covariance  matrix  of  the  TSE  of  Because  the  covariance 
matrix  will  be  a  function  of  the  probability  weights  si,  ...,  h  is  clear  that  the  degree  of 
partitioning  can  affect  the  asymptotic  approximation  of  the  TSE  to  the  GMLE  in  a  more 
significant  way.  We  will  relegate  this  aspect  of  research  to  the  second  year  of  our  DOD 
grant. 


data 

cpu  time 

A  =  -0.1 

/52  =  0.2 

CO 

11 

1 

O 

original 

10  min. 

-0.092 

0.209 

-0.092 

(0.083) 

(0.085) 

(0.081) 

grouped 

4  min. 

-0.093 

0.217 

-0.094 

width=3 

(0.068) 

(0.136) 

(0.055) 

grouped 

4  min. 

-0.094 

0.239 

-0.094 

width=5 

(0.050) 

(0.219) 

(0.048) 

grouped  ' 

* 

* 

* 

* 

width=8 

*  Calculation  not  possible  due  to  sparseness  of  data 

Table  1.  Monte  Carlo  simulations  for  TSE  of  ^  for  n  =  30 


data 

cpu  time 

I3i  =  -0.1 

132  =  0.2 

CO 

II 

1 

p 

f-l 

original 

58.3  min. 

-0.087 

(0.419) 

0.191 

(0.281) 

-0.087 

(0.406) 

grouped 

width=3 

6.5  min. 

-0.089 

(0.333) 

grouped 

width=5 

5.6  min. 

-0.090 

(0.294) 

-0.090 

(0.286) 

grouped 

width=8 

4.7  min. 

-0.092 

(0.205) 

0.212 

(0.250) 

-0.093 

(0.171) 

Table  2.  Monte  Carlo  simulations  for  TSE  of  ^  for  n  =  200 

Incidentally,  the  close  approximation  of  the  GMLE  values  of  ^  to  the  true  ^  (first  row 
of  Table  1  or  Table  2)  indicates  that  the  GMLE  of  §_  is  consistent.  Similarly,  rows  2-4  of 
Table  1  or  Table  2  suggest  that  the  TSE  of  /3  is  consistent. 
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D.  KEY  RESEARCH  ACCOMPLISHMENTS  IN  THE  FIRST  YEAR 


•  We  have  completed  Task  1. 

We  have  successfully  implemented  a  computer  progam  to  calculate  the  TSE  of  the  Cox 

regression  coefficients  /3. 

•  We  have  completed  Task  2. 

We  have  demonstrated  by  Monte  Carlo  simulations  that  the  TSE  of  ^  is  not  much 

affected  by  the  degree  of  partitioning  used  in  data  grouping. 

•  We  have  begun  to  work  on  Task  3. 

We  have  demonstrated  by  Monte  Carlo  simulations  that  the  GMLE  of  ^  is  consistent. 

•  We  have  begun  to  work  on  Task  4. 

We  have  demonstrated  by  Monte  Carlo  simulations  that  the  TSE  of  ^  is  consistent. 

E.  REPORTABLE  OUTCOMES 

•  A  computer  program  to  calculate  the  GMLE  of  the  baseline  survival  function  Sq  and 

that  of  the  Cox  regression  coefficients 

•  A  computer  program  to  calculate  the  TSE  of  of 

Both  of  these  computer  programs  have  been  made  available  for  the  public  via  the 
internet  site  math.binghamton.edu. ftp/pub/qyu. 

F.  CONCLUSIONS 

In  the  first  year  of  our  DOD  grant,  we  have  successfully  completed  our  research  ob¬ 
jectives  stated  in  Tasks  1  and  2.  In  addition,  we  have  begun  research  work  pertaining  to 
Tasks  3  and  4.  We  have  implemented  a  computer  program  to  compute  both  the  TSE  of  the 
baseline  survival  function  and  the  TSE  of  the  regression  coefficients  of  the  Cox  regression 
model  under  interval  censorship.  Using  Monte  Carlo  simulations,  we  have  demonstrated 
that  the  TSE  of  the  regression  coefficients  is  consistent. 

The  results  which  we  have  established  will  be  useful  to  breast  cancer  researchers  pur¬ 
suing  chemoprevention  intervention  trials  involving  surrogate  endpoints  biomarkers,  and 
genetic  epidemiologists  conducting  studies  on  familial  aggregation  of  breast  cancer  and  re¬ 
lated  cancers. 
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